MLLM架構的演進
多模態大型語言模型(MLLM)的演進,標誌著從特定模態的孤島轉向統一表示空間,在其中非文字訊號(圖像、音訊、3D)被轉換成語言模型能夠理解的語義形式。
1. 從視覺到多感官
- 早期的MLLM:主要專注於視覺變壓器(ViT),用於圖像-文字任務。
- 現代架構:整合音訊(例如 HuBERT、Whisper)以及3D點雲(例如 Point-BERT),以實現真正的跨模態智能。
2. 投影橋接
為了將不同模態與語言模型連結,需要一個數學橋接機制:
- 線性投影:早期模型(如 MiniGPT-4)中使用的簡單映射。
$$X_{llm} = W \cdot X_{modality} + b$$ - 多層MLP:一種兩層方法(例如 LLaVA-1.5),透過非線性轉換實現對複雜特徵更優異的對齊。
- 重新取樣器/抽象器:先進工具,如 Perceiver Resampler(Flamingo)或 Q-Former,能將高維數據壓縮為固定長度的詞元。
3. 解碼策略
- 離散詞元:將輸出表示為特定詞典條目(例如 VideoPoet)。
- 連續嵌入:使用「軟」信號來引導專用的下游生成器(例如 NExT-GPT)。
投影規則
要讓語言模型處理聲音或3D物件,訊號必須被投射到語言模型現有的語意空間中,使其被視為「模態訊號」而非雜訊。
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?
Question 2
What is the primary role of ImageBind or LanguageBind in this architecture?
Challenge: Designing an Any-to-Any System
Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.
You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.
Step 1
Select the correct encoder for the input signal.
Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Step 2
Apply a Projection Layer.
Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Step 3
Generate and Decode the output.
Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.